Estimating individual treatment effect on disability progression in multiple sclerosis using deep learning

Disability progression in multiple sclerosis remains resistant to treatment. The absence of a suitable biomarker to allow for phase 2 clinical trials presents a high barrier for drug development. We propose to enable short proof-of-concept trials by increasing statistical power using a deep-learning predictive enrichment strategy. Specifically, a multi-headed multilayer perceptron is used to estimate the conditional average treatment effect (CATE) using baseline clinical and imaging features, and patients predicted to be most responsive are preferentially randomized into a trial. Leveraging data from six randomized clinical trials (n = 3,830), we first pre-trained the model on the subset of relapsing-remitting MS patients (n = 2,520), then fine-tuned it on a subset of primary progressive MS (PPMS) patients (n = 695). In a separate held-out test set of PPMS patients randomized to anti-CD20 antibodies or placebo (n = 297), the average treatment effect was larger for the 50% (HR, 0.492; 95% CI, 0.266-0.912; p = 0.0218) and 30% (HR, 0.361; 95% CI, 0.165-0.79; p = 0.008) predicted to be most responsive, compared to 0.743 (95% CI, 0.482-1.15; p = 0.179) for the entire group. The same model could also identify responders to laquinimod in another held-out test set of PPMS patients (n = 318). Finally, we show that using this model for predictive enrichment results in important increases in power.

from the text as it stands how many ORATORIO patients have been identified in this responder group. From supplementary figure one, it looks as if this group has less than 30 patients. The issue of sample size in the progressive MS trials is therefore not addressed. The responder group in laquinimod (I assume this is Arpeggio trial) is again small (14 vs five by the end of year 2), and the confidence intervals are very wide (supplemental figure 2) to provide reliable estimates. It is odd that in a clinical trial with proven treatment effects (ocrelizumab), the hazard ratio is 0.36, but in a clinical trial of a failed treatment (laquinimod), the HR of 30% of the most likely responders is 0.28. It is unexpected that in the ORATORIO trial for the 50th percentile of responders, the HR of disability progression is not significant. In the original publication (Montalban et al., 2018, New England Journal of Medicine), the HR for the whole group was significant. Therefore, I would expect that half of those who are predicted by authors' model to be monst likely to respond, should also have a significant HR. It is hard to convince the reader that the presented results are reliable and in line with the literature. Computer code and statement on their availability should be added. This is important for reproducibility.
Reviewer #2 (Remarks to the Author): Comments for "Estimating treatment effect for individuals with progressive multiple sclerosis using deep learning" The authors have addressed my comments on the previous version of the manuscript. I have a couple of additional comments. Comment: 1) Page 5/ Table 3: Is there a way to compare the values for the ADCwabc across the models? I understand the rank ordering of the values, but I do not have much intuition regarding the magnitude of the differences between the models. This would help to understand whether the models or subgroups with higher values are slightly or much different from the models or subgroups with lower values. 2) Page 5: I believe the p-value for the first hazard ratio in Section 2.3 is incorrect. Also, should the ADCwabc in Section 2.3 match the value from Table 3 (0.0208 vs 0.0211)? 3) Page 6: You state that the prognostic MLP model is second best in the Anti-CD20-Abs dataset, but it seems that it is fourth after the two other versions of MLP and T2 lesion volume/ disease duration. 4) Page 6: You state, "indicating no improvement in treatment effect compared to the whole group" for both Anti-CD20-Abs and laquinimod, but the HR for the laquinimod subgroup is smaller than either of the HRs you identified in your predictive enrichment groups from section 2.3 (0.305 vs. 0.492 and 0.338). 5) Page 6: Could you provide more details regarding your sample size calculation? I tried to reproduce your calculation and I got similar sample size calculations, but I would like to know the exact approach so it could be reproduced. 6) Page 13: Was the ridge regression model fit with the EDSS slope outcome? 7) Table 1: Could you check the RMST? It seems too close to 2 since it is calculated at 24 months.
Reviewer #3 (Remarks to the Author): The authors have carefully considered the feedback from the reviewers, and have amended the manuscript as appropriate. I have no further comments.
Reviewer #1 (Remarks to the Author): The manuscript entitled "Estimating treatment effect for individuals with progressive multiple sclerosis using deep learning " has undergone major revisions, and some of the prior points have been addressed. However, several major issues remain, and some contradictions in predicting treatment response with the literature are introduced that are difficult to understand. I have elaborated below and hope that this will help the manuscript to improve even further: The study is not about progressive MS. Instead, revisions have added RRMS patients (<800 individuals). While this is commendable, the models are trained and assessed on a majority RRMS population (understandably so given that RRMS is much more common, but this contradicts the claims on finding specific progressive MS models). In phase one of training, all patients are RRMS. Additionally, the model performs much better for patients with disability levels that are usually not considered progressive MS (EDSS<4.5). Thirdly, the most important factor (if I understand Table  3 correctly) is the T2 lesion volume. For these reasons, I suggest changing the title and the focus throughout from PMS to MS in general. This, however, brings a question on the novelty of this study in understanding mechanisms of action given the number of prior studies (Dr Rio's treatment response scores, and Dr Sormani's prior work, and several other studies) on RRMS that have identified similar factors and many more on disease activity using transparent (as opposed to black-box) methods.
Response: We thank the reviewer for their comment. We would like to apologize if some of the comments we made in our manuscript lacked clarity. First, we have added 2,520 new RRMS patients to our training dataset, and not < 800. This is stated in Section 2.1, second paragraph. Second, the RRMS dataset is used only for pre-training, and the model is then fine-tuned and tested solely on the progressive MS subset, which is the main focus of our work. This "transfer learning" strategy was implemented to mitigate the issue of small sample size in progressive MS patient datasets, a point previously brought up by one reviewer. Pre-training on a related dataset is a common technique in the field of machine learning to increase the robustness of learned representations when dealing with smaller sample sizes. This was already explained in section 2.1, second paragraph. Third, while it is true that secondary progressive MS patients usually have higher EDSS scores, this is not the case with primary progressive MS patients which begin progression from the onset of their disease (EDSS can range from 0 to 10). Since our evaluation dataset includes patients with primary progressive MS, the fact that our model performs better in those with EDSS < 4.5 is not problematic.
However, we appreciate that the work involved both RRMS and progressive MS, and that similar pathophysiological processes occur in both clinical subtypes. We therefore agree to change the title and focus throughout the paper to be about progression in MS in general. We have also clarified which sub-type of MS is used for which phase of our analysis in the abstract and main text (Section 2.1).
Ditto, as suggested by other reviewers, I suggest removing claims on finding specific mechanisms and revising those claims to find individuals' characteristics that contribute to a more beneficial treatment response in light of the new analyses. This issue has been overlooked.

Response:
We make no claim about finding specific mechanisms in our manuscript. The other reviewers do not suggest removing any specific mechanistic claim from this version of the manuscript (please refer to their comments below). In the absence of a more specific request, we are therefore under the impression that no further changes are needed.
The revision has brought forward one major issue. There is a lack of uniform image processing, which was not clear in the previous version but adding a standardization has changed almost all the important factors related to treatment response. I sympathize with the fact that this important study brings together older trials that have had different scanners and acquired at different times; however, a minimum expectation is to use a uniform image processing pipeline to reduce confounders and focus on biological questions. This had caused a significant discrepancy in the pre-revision version, in which disease activity differed in the training and test set. However, in the revision, the expectation is to use a known and validated harmonization method based on image processing techniques rather than simply normalizing values, which will not address the fundamental issue of trial confound (that is, using different image processing techniques in trials that have been acquired possibly decades away from each other). The scaling factor calculated and used is not validated before. I suggest redoing this part by processing MRI scans using a uniform tool for all trials. The response to the review brings this forward and is unsatisfactory: "ARPEGGIO used a different segmentation algorithm with a different sensitivity for lesions, and the number/size of lesions identified was smaller than the other trials. The model, therefore, was unlikely to generalize what it had learned about the lesion metrics to this dataset. We have now corrected this by scaling all the segmentation-related metrics to a common range (described in the Methods)." Response: We appreciate the reviewer's comment on harmonization, which made us realize that our previous response to this question was unclear and could have led to confusion. The segmentation-based metrics, particularly the T2 lesion volumes, used in this study are derived from the ground-truth lesion masks, which were generated independently (by an image analysis centre outside of this study) during the course of each clinical trial.
A fully manual or a semi-automatic segmentation strategy was used during clinical trial analysis for each trial. This analysis began with automated segmentation and was followed by manual correction by experts. The automated segmentation algorithms were proprietary for the image analysis centre performing the measurements and are not available to external investigators (such as ourselves). In addition, there are well known differences in the approach to determining lesion boundaries (school-effects) that can result in substantial differences in lesion volumes between different reading centers. Thus, the lesion masks we used are the best approximation we have to ground truth, but would not be expected to be identical between each expert and reading centre.
Repeating this process using a consistent but different image processing pipeline and different expert annotators would not necessarily result in "better" segmentation masks, and is impractical as this type of work is extremely labour intensive and originally cost millions of dollars to perform. It is highly unlikely that any other investigator would have the resources to reproduce our work or adapt it to new data if we were to do this. The approach we have taken is much more practical in that it standardizes the range of input data to account for these school-effects, and therefore only requires access to the lesion counts/volumes that were generated during a clinical trial, thus making it feasible for other investigators to use this approach on other datasets obtained under different circumstances.
In fact, scaling the range (and/or shifting the mean) of a feature's input distribution to a reference obtained from a training dataset is common practice in machine learning and serves to improve model optimization dynamics The revision has not fully addressed the issue of sample size in progressive MS. The title refers to predicting treatment effect in progressive MS, yet new revisions have added 803 subjects to the relapsing-remitting MS population. My comment and that of editors in prior revision have not been addressed. Changing the title and editing the manuscript, however, should address this.